After briefly going through the IMDB movie dataset, one can start to notice some correlations or trends between various characterstics of the movie. The pertinant business question that any Data Analyst would ask when browsing through this data set is to find out what characterstics of movies produce the highest revenue. This investigation and reaserch would help the film industry to work on those characterstics that would generate the highest revenue and by extension increase the movies popularity
It is due to these above mentioned reasons, that I have decided to keep one major overaching question that would drive the entire investigation and from which other sub-questions and trends would be discovered.
The focus of our investigation would revolve around
"What characterstics and traits of a movie generate higher revenue". Is their a positive correlation between one characterstic and revenue and can one conclude that this correlation might also be defined as causation?
From this other subquestions can be derived and various other trends explored for example:
1. Does the budget have a direct correlation with the revenue produce. Our normal understanding would be that the higher the investment, the greater the revenue that shall be generated. We shall use the data to either support or disprove this statement.
2. Do any other characterstics support any other co-relation either positive or negative for high revenue? Our further exploratory analysis would lead us to ask more questions.
The other significatiant investigation we shall be doing is understanding trends within the dataset independent of one another to come to conclusions. Whether certain trends show a decline in a particular characterstic or help increase it.
# Importing the important libraries that we shall use in this investigation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
import matplotlib
sns.set(style="ticks", color_codes=True)
- Too many null(or '0' values in the budget_adj and revenue_adj coloumn. We cannot take weighted average as almost half of the data is missing so shall lead to inaccurate results.
- release_date, revenue_adj, budget_adj not in the correct datatype.
- Results show one duplicated values.
- Corrupt values found in 'directors' coloumn.
- Cast, Genre, production_companies contain numerous values per cell delimite by '|'. We might need to work on a mechanism to extract them
#Load dataset
# Original Dataset from https://www.kaggle.com/tmdb/tmdb-movie-metadata
df = pd.read_csv('tmdb-movies.csv')
#Data wranging. Lets investigate the data.
# df.describe()
df.info();
#I saw some zeros when browsing the data. Let me see how many they are randomly
# df['budget_adj'].head(80)
df['revenue'].head(60)
# df.head(4)
# Below shows significant amount of missing values.
print('Count of missing values(zeros) in budget_adj colm')
df[df['budget_adj'] == 0].count()
# Below shows significant amount of missing values.
print('Count of missing values(zeros) in revenue_adj colm')
df[df['revenue_adj'] == 0].count()
#Search for missing values
df.duplicated().sum()
# only one duplicate value
We'll clean and extract the relavent data over here. However, since priminarly investigations show that certain coloums(e.g revenue, budget) have a high amount of null(or '0'/Zero values). Removing them in the earlier stage would significantly reduce the size of the dataset therefore making whatever conclusions we draw less credible. We shall therfore decide to remove them(missing values) accordingly to the needs of the questions/exploratory analysis we do.
1. Removing the following coloumns:
a. Removing budget and revenue coloumns since 'revenue_adj' and 'budget_adj' is more appropriate since they take into consideration inflation.
b. Dropping the following coloumns since we do not find them relavent to the dataset. We shall keep the 'id' coloumn as a unique identifier for the row. [imdb_id, homepage, tagline, overview]
Changing the type of the following coloumn.
- a. release date to data typeRemoving duplicate (NaN) rows from the dataframe
Note that we shall further clean the dataset when we further investigate the dataset. Cleaning or removing values right now will cause a significant decrease in data
df.head(2)
# After discussing the structure of the data and any problems that need to be cleaned,
# perform those cleaning steps in the second part of this section.
# Removing budget and renvuue since revenue_adj and budget_adj is more appropriate since it considers inflation
df.drop(['imdb_id','budget', 'revenue', 'homepage'], axis=1, inplace=True)
#renaming to more descriptive coloumns.
df.rename(index=str, columns={"budget_adj": "total_budget", "revenue_adj": "total_revenue"}, inplace=True)
#changing type of date
df['release_date'] = pd.to_datetime(df['release_date'])
df.head(2)
df.info()
#removing duplicate values
x = df.duplicated().sum()
print("No of Duplicate Values", x)
df.drop_duplicates(inplace=True)
y = df.duplicated().sum()
print("No of Duplicate Values after cleaning", y)
First make a copy of the original cleaned dataframe and perform changes with regards to this question only on this dataframe so that the original remains unaffected.
Cleaning data - and removing all those values which are empty in both total_revenue and total_budget.
The removal of values reduces the dataset to 50% therefore one can question the crediblity of the analysis.
# 1 Copy the dataframe so that the main dataframe remains unaffected
df_q2 = df.copy(deep=True)
# Cleaning data - and removing all those values which are empty in both revenue and budget.
# The removal of values reduces the dataset to 50% therefore one can question the crediblity of the analysis.
# 2 remove all null values
df_q2 = df_q2[df_q2['total_revenue'] != 0]
df_q2 = df_q2[df_q2['total_budget'] != 0]
#check the data
df_q2.head(2)
# 3 a Plot of budget and revenue
sns.pointplot(x="total_budget", y="total_revenue", data=df_q2);
# sns.pointplot(x="total_revenue", y="total_budget", data=df_q2);
# 3 b Scatter plot of budget and revenue, shows better correlation
sns.regplot(x="total_budget", y="total_revenue", data=df_q2, scatter_kws={'alpha':0.3});
Plot the graph between budget and revenue. The graph/scatter plot shows that both [1] the budget and revenue are positivly correlated . As the budget increases the average revenue of a movie also increases.
Q1 - b) Similary lets see if there is any relationship between movie duration and revenue?
# 3 c Graph between populaity and revenue shows that there isn't any correlation betweeen these two values
# matplotlib.pyplot.scatter(df_q2['popularity'] ,df_q2['total_revenue']);
# matplotlib.pyplot.xlabel("Popularity")
# matplotlib.pyplot.ylabel("Revenue");
sns.pointplot(x="popularity", y="total_revenue", data=df_q2);
Lets remove the outliers present in our data.
# 4 clear outliers
df_q2_removeOutliers = df_q2[df_q2['popularity'] < 15]
#matplotlib.pyplot.scatter(df_q2_removeOutliers['popularity'] ,df_q2_removeOutliers['total_revenue']);
#matplotlib.pyplot.xlabel("Popularity")
#matplotlib.pyplot.ylabel("Revenue");
sns.pointplot(x="popularity", y="total_revenue", data=df_q2_removeOutliers);
The above graph shows that there is a significant relationship between popularity and revenue The more popularity a particular movie has the more revenue it usually tends to produce.
Q1 - c) Reputed directors traditionally do have a impact on the movie revenue. Lets first find out the directors which directed the most movies and then compare the revenues made by the movies they directed. We will find out whether a particular director directing a movie has any(if any) significant impact on the revenues.
# 5 Drop all null values.
df_q2 = df_q2.dropna()
# Sieve through the dataset by using a group by function and find out the list of directors which produced the most
# movies in descending order. Since the number of directors in this dataset is huge, we shall only consider the top few who
#have directed a greater number of movies.
top_actors = (df_q2.groupby('director')['original_title'].count()).sort_values(ascending=False)
#List of top directors aka top_actors
top_actors[:30].plot(kind='bar', figsize=(15,11), title='No. of movies directed by top 30 directors').yaxis.label.set_visible(False);
One can observe from the above graph that Steven Spielberg and Ridley Scott and Client Eastwood have directed a marked greater number of movies than the rest.
The below code first sieves through dataframe and extracts all those rows which contains the names of the top 21 directors. The for loop extracts the row from the dataframe and keeps on appended it to the df_topDirectors dataframe.
# 6 Now create a new dataframe which only contain the rows belonging to the movies produced by these directors
df_temp = pd.DataFrame()
for actor,v in top_actors[:21].items():
df_temp =df_temp.append(df_q2[df_q2['director'].str.contains(actor)],ignore_index=True)
df_topDirectors = df_temp
# 7 a convert to float datatype
df_topDirectors['total_revenue'] = df_topDirectors['total_revenue'].astype(str).astype('float64')
# 7 b take the mean of the revenue produced by all the movies directed by the director.
df_topDirectorsRevenue = df_topDirectors.groupby(['director']).mean()
# 7 c Extract the relavent coloumns needed for plotting
df_topDirectorsRevenue =df_topDirectorsRevenue.drop(df_topDirectorsRevenue.columns.difference(['director','total_revenue']), 1)
# 7 d Sort the total_revenue coloumn in descending order
df_topDirectorsRevenue = df_topDirectorsRevenue.sort_values(by='total_revenue', ascending=1)
df_topDirectorsRevenue.plot(kind='barh', figsize=(10, 10), sort_columns='True', title='Average Revenue of Movies per directors' );
#one might want to fiddle with the data and use a pie chart
df_topDirectorsRevenue[:3].plot(kind='pie', subplots=True, figsize=(10, 10), sort_columns='True', title='Average Revenue of Movies per directors', legend=False );
The bar chart above shows that directors who have directed a total of 18 movies. We can infer that the movies directed by the topmost directors from Peter Jackson to Tim Burton tend to have a greater revenue. Lets further investigate if this inference hold any crediblity by looking into the details of the top director - Peter Jackson
# 8 a Extract rows containing movies directed by Peter Jackson
Peter_Jackson=df_topDirectors[df_topDirectors['director'].str.contains('Peter Jackson')].sort_values(by='total_revenue', ascending=0)
Peter_Jackson[['id', 'director', 'original_title' ,'total_revenue']]
Lets have a look at the data in graphical format aswell
Peter_Jackson[['total_revenue']].plot(kind='bar', sort_columns='True', title='Movie revenue directed by Peter Jackson');
Lets also have a look at the movies directed by Stevin Speilburg before making some conclusions
# 8 b Extract rows containing movies directed by Steven Spielburg
StevenSp=df_topDirectors[df_topDirectors['director'].str.contains('Steven Spielberg')].sort_values(by='total_revenue', ascending=0)
StevenSp[['id', 'director', 'original_title' ,'total_revenue']]
Lets have a look at the data in graphical format aswell
StevenSp[['total_revenue']].plot(kind='bar', sort_columns='True', title='Average Revenue of Movies directed by Steven Speilburg');
Having evaluated our findings from the above dataset(tables and graphs), there isn't substancial evidence to suggest that the movies produced by such directors would generate high revenue as these table show that while some movies did produced a marked high revenue well above the global mean revenue, there were significant other movies that did not perform as well.
Next we shall see the trends in revenue over the years.
sns.set(rc={'figure.figsize':(26.7,8.27)})
sns.barplot(x="release_year", y="total_revenue", data=df_q2);
df_q2['total_revenue'].mean()
The above graph shows that the revenue(with fluctuations) was higher from the 60's to the 70's while it remained almost the same and consistant with minor fluctuations after the 90's However, it must also be noted that the during the data cleaning process alot of missing data was from 1960s to 1980, and when cleaned only few data remained which contained the revenue of the movies. This does make one consider the crediblity of this result.
So considering our inital question that 'Are there any features/properties in the data which are correlated to or effect revenue?'
We find that
However we did discover that Steven Spielberg and Peter Jackson have directed a marked greater number of movies than the rest.
We can also see our conclusion summerized in the below correlation matrix created
revenue_corr = df_q2.copy(deep=True)
# 9 a create a correlation matrix
revenue_corr=df_q2.copy(deep=True)
revenue_corr['total_revenue'] = revenue_corr['total_revenue'].astype(str).astype('float64')
revenue_corr['total_budget'] = revenue_corr['total_budget'].astype(str).astype('float64')
revenue_corr.drop(columns=['id'], inplace=True)
revenue_corr =revenue_corr.corr()
# 9 b Plot the matrix using following configerations
plt.figure(figsize=(13,13))
revenue_corr_1=plt.matshow(revenue_corr,fignum=4)
plt.xticks(range(len(revenue_corr.columns)),revenue_corr.columns)
plt.yticks(range(len(revenue_corr.columns)),revenue_corr.columns)
plt.colorbar(revenue_corr_1,fraction=0.02)
plt.show();
High Postive correlation between
Low correlation between
Now having explored the relation between revenue and various other features, We shall explore more and see whether there are other trends present in the data.
Q2 - a) Which genres are the have the most popular genres over the years? How can I see these trends in the movies.
I have decided to use pie charts to describe the trends of the movies over the years.
# 10 a create new dataframe for this question.
df_q1 = df.copy(deep=True)
Cleaning the dataset according to requirments and only keeping those coloumns that are actually needed.
df_q1.drop(df_q1.columns.difference(['popularity', 'original_title','genres','release_year']), 1, inplace=True)
# 10 b Splitting the various genres using the delimeter
df_q1.genres = df_q1.genres.str.split('|')
The below function explode that I have used is taken from stackoverflow. Which allows me to seperate the various movie rows according to their multiple genres. So for example if I have a movie which is of action and drama genre. This function would allow me to split the row into 2 rows each having a single genre of 'drama' and 'action'
# 10 c splits joint genres into two seperate genres "i.e., romance/comedy -> romance, comedy"
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
#10 d Dropping null values and calling the explode function
df_q1.dropna(inplace = True);
df_q1 = explode(df_q1, ['genres'])
# What are the most popular genres
Having extracted prelimenary results, we can now visualize which genre had the most movies over all the years. The below code and the subsequent graph does just that.
popular_genres = df_q1['genres'].value_counts()
popular_genres.plot('bar', figsize=(15,10), sort_columns='True', title='Popular Genres');
While the above graph does do a good job in explaning the overall trend of the genres popularity by summerizing the dataset, it does not show us the trends over the years where we can individually deduce and see which genre was most popular in one year and whether its popularity increased or decreased in the next or previous years. This year by year analysis cannot be done by the above bar chart therefore I decide to use pie charts which would enable us to indivually see genre trends over all the years.
# 11 a We first extract all the years present in the dataframe and sort them.
years = df_q1['release_year'].unique()
years.sort()
# 11 b We then loop over these subsequent years and extract the movie genre from that particular year and group them by
# their popularity(sum). We then plot each using a pie chart.
for i in years:
genre_popularity = df_q1[df_q1['release_year'] == i]
genre_popularity = (genre_popularity.groupby('genres')['popularity'].sum()).sort_values(ascending=False)
genre_popularity.plot(kind='pie', figsize=(7,7));
plt.title(i)
plt.show()
Q2 - b) Lets make another observation regarding voter count, the vote average and how it has changed over the years.
# 12 a
df_q3 = df.copy(deep=True)
df_q3 =df_q3.drop(df_q3.columns.difference(['vote_count','vote_average','release_year']), 1)
# 12 b group vote count by release year and find average
votes =(df_q3.groupby('release_year', sort=True).mean())
# y = df_q3.groupby('genres')['popularity'].sum()
votes.head(4)
# 12 c graph the results for vote count
votes['vote_count'].plot(title='Average Vote count over the years');
# 12 d graph the results for average vote
votes['vote_average'].plot( title='Average vote over the years');
We note that the vote average count has increased steadily over the years however the vote average has only seen a spike in the mid 70s after which there was a decline(albiet with myraid fluctuations).
Limitations:
Hence the biggest limitation of this dataset is the fact that it contains a limited number of values and any conclusion that will be drawm would lack reliablity since the underlaying data is limited.
The second limitation in this IMDB dataset is the fact that there is also noise in the data. While I have been able to in various occasions remove outliers. This was always not possible considering the limited skill set I have.
Conclusions: While I have given my conclusions of the various features, the dataset itself and any correlations on various occassions in the study, I shall summerize my findings below.
To anwser the signifcant question we asked in the beginning of our analyis that
'What features in the data are correlated to or effect the revenue of the movie?' (Question 1)
We find (via Exploratory Data Analysis) that:
There is a significant correlation between the movie budget and revenue. The data suggests that on avearge, movies with a higher budget tend to perform better than those with a lower budget. However, the signifcant limitation when coming to a reliable conclusion was the lack of data we had. After cleaning the coloumn of zero values we were left with half of the orignial values. That coupled with the fact that there were a significantly higher amount of missing values of the movies present in the early 50 to 70s.
Popularity and revenue are also correlated. This seems an almost predictable conclusion considering that a popularity of a movie generally tends to produce a higher revenue. We now have data to support this inference.
The movies directed by directors Peter Jackson, Steven Spielberg and Micheal Bay on average have a marked higher revenue than the remaining directors. It is also interesting to note that Peter Jackson though has directed a sizable number of movies, still has directed less than the top 15. This goes to suggest that he appears to be a competent director .
This however is not enough to come to a definite conclusion that all famous directors(or those who have directed a sizable number of movies) generate the highest revenue. Even though the data does show a relationship between top directors producing more revenues for a movie than other directors.
As mentioned numerous times earliers, any inference/observation which also takes in consideration the movie revenue is not very reliable considering the lower number of data.
Exploring other trends within the data (Question 2)
The following trends were found in the data:
Reference:
1)Please note that the 'explode' function used to seperate the movie genres from the genre coloumns was taken from the following stackoverflow question
https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows
2) I also drew inspiratin from the following Kaggle analysis. It was interesting to note that the original dataset contains far more features and the limitless possiblities of data wrangling that can be done.
https://www.kaggle.com/aninda123/imdb-movie-analysis